logo



Before we begin please note anything bolded is either a finding or an important comment

Introduction


This blog illustrates the use of R to analyze COVID-19 Outbreak in Peru by applying different algorithms in an attempt to gauge the spread of the virus and success of public health containment efforts in reducing the speed and degree of spread.

For this analysis, I’m laboring Tim Churche’s blog which containings a detailed analysis of COVID-19 data for China.

This analysis is for informational purposes only and should not be relied upon for to inform governmental or health care policy.

Data sources


Data utilized for this analysis comes from Coronavirus wiki, Peruvian Ministry of Health MINSA, Outbreak Dashboard and nCOV19 package. The use of these pages through web scraping can be challenging, however, because the format of the tables in which the data appears changes many times per day.

Data limitations


The biggest problem with these analyses is that they are not based on data tabulated by date of onset (of symptoms), but rather on data tabulated by date of reporting, or perhaps date of confirmation by lab test. We understand that lab results from molecular tests can take up to 7 days to be published while lab results from serological tests are not reliable due that test looks for antibodies within the blood stream which are usually developed from 7 to 30 days after contracting the virus.

If lab testing and reporting is swift and the lag is uniform, then the date of reporting or confirmation will lag consistently and only slightly behind the date of onset. Sadly that’s not usually the case, there are tipacally variable delays in lab confirmation and/or reporting, leading to backlogs of cases being reported all at once (known as a “data dump” by outbreak epidemiologists). This is very unfortunate, and it frustrates efforts to properly assess outbreak trajectories and effectiveness of intervention strategies. One solution would be for authorities to provide line-listings of all cases, including their date of reporting and their date of onset of symptoms, even if only presumptive dates are provided (and missing dates of onset can be imputed, in any case).

The effect of the unavoidable (because that’s all that is available) use of data tabulated by date of reporting here (and just about everywhere else) means that reproduction numbers and related measures of infectiousness may be inflated, particularly at first. Nonetheless, the methods illustrated here show the potential utility of epidemic trajectory analysis – but that utility would be so much greater if health authorities published data tabulated by date of onset, or better, line-listings that include date of onset. It would be very easy for them to do that.

Data Update


This initial analysis is meant to visualize and forecast COVID-19 spread in Peru.

Peru COVID-19 dataset was updated on: 13May20

Analysis methods


The analysis presented here primarily uses R packages developed and published by the R Epidemics Consortium (RECON).

The functions provided by the RECON incidence package are used to fit log-linear models to the growth and decay phases of the outbreak in Peru, and the decay phase model is used to very approximately predict when local transmission will be extinguished there.

I also utilize earlyR and EpiEstim packages, which are also published by RECON. In particular the get_R() function in earlyR calculates a maximum-likelihood estimate (from a distribution of likelihoods for the reproduction number) as well as calculating λ, which is a relative measure of the current “force of infection”:


$$ \lambda = \sum_{s=1}^{t-1} {y_{s} w (t - s)} $$

where w() is the probability mass function (PMF) of the serial interval, and ys is the incidence at time s.

The resulting λ “force of infection” plot indicates the daily effective infectiousness (subject to public health controls), with a projection of the diminution of the force of infection if no further cases are observed. The last date of observed data is indicated by the vertical blue dashed line. New cases are shown in a cumulative manner as black dots. It is a sign that the outbreak is being brought under control if λ, as indicated by the orange bars, is falling prior to or at the date of last observation (as indicated by the vertical blue line). Note that left of the vertical blue line the λ values are projections, valid only in no further cases are observed. As such, the plot is a bit confusing, but it is nonetheless useful if interpreted with this explanation in mind. The RECON packages are all open-source, and easier-to-interpret plots of λ could readily be constructed.

The critical parameter for these calculation is the distribution of serial intervals (SI), which is the time between the date of onset of symptoms for a case and the dates of onsets for any secondary cases that case gives rise to. Typically a discrete γ distribution for these serial intervals is assumed, parameterised by a mean and standard deviation, although more complex distributions are probably more realistic.

For the estimation of the force of infection λ, for the serial interval we’ll use a discrete γ distribution with a mean of 5.0 days and a standard deviation of 3.4.


Data Understanding


After having an understanding on what is our business problem, initial datasets are loaded for data preparation and exploration prior model development.

Libraries & Functions


Let’s begin by loading the libraries and functions that will be used for this analysis:

##                 plyr                dplyr           data.table 
##                 TRUE                 TRUE                 TRUE 
##               tibble              stringr              stringi 
##                 TRUE                 TRUE                 TRUE 
##             forecast            tidyverse                broom 
##                 TRUE                 TRUE                 TRUE 
##                   gt                caret                  zoo 
##                 TRUE                 TRUE                 TRUE 
##               plotly                   DT                tidyr 
##                 TRUE                 TRUE                 TRUE 
##                rvest           hrbrthemes            inspectdf 
##                 TRUE                 TRUE                 TRUE 
##         DataExplorer              ggplot2           tsoutliers 
##                 TRUE                 TRUE                 TRUE 
##             ggthemes                  dlm PerformanceAnalytics 
##                 TRUE                 TRUE                 TRUE 
##          projections               earlyR             ggdendro 
##                 TRUE                 TRUE                 TRUE 
##                rgdal              tsibble             EpiEstim 
##                 TRUE                 TRUE                 TRUE 
##              epitrix            distcrete           shadowtext 
##                 TRUE                 TRUE                 TRUE 
##             nCov2019             magrittr            incidence 
##                 TRUE                 TRUE                 TRUE

Data Collection


Raw data for each Peru COVID-19 dataset are loaded.

Summary


The COVID-19 dataset (COVID_19_Peru_Raw) has 67 rows and 12 columns while COVID-19 dataset case increase every 2 days has 23 rows and 3 columns

Worldwide Analytics


Please see below for COVID-19 Outbreak Worldwide Analytics updated on 12May20.

Outbreak trend since 100th cases


Here we observe Worldwide Outbreak trend after the 100th reported possitive case:

Geographic Cumulative Cases


Below we find a geographic heatmap based on possitive confirmed cases up to date:

Epidemic Heatmap for top 25 countries on the last 7 days


Heatmap of epidemic situation around the world in the last 7 days (log scale) for top 25 countries with higher number of possitive cases.

Peru Analytics

Exploratory Data Analysis (EDA)


After raw data is loaded, I proceed to perform EDA on dataset.

Data Cleaning


For this step, I perform a data cleaning if needed.

Daily cumulative incidence


First, let’s look at the daily cumulative number of incident, lab-confirmed cases for Lima, for all the other provinces combined, and for all of Peru.

The initial increases for Lima and the balance of the other provinces, and for all of Peru look to be approximately sigmoidal, as is expected for epidemic spread. Let’s plot them on a logarithmic y axis. We would expect to see a linear increase on a log scale if the epidemic curve is indeed exponential.

One could convince oneself that log-linearity is present.

Daily incremental incidence


Let’s also look at the daily incremental incidence. This is more informative, and is known in outbreak epidemiological parlance as the epidemic curve. It is traditionally visualised as a bar chart, which emphasises missing data more than a line chart, even one with points as well. We’ll adhere to epidemiological tradition.

At the time of writing (13May20), it looks like incidence has yet to plateaued in Peru.

Daily cumulative and incremental deaths in lab-confirmed cases


Now let’s look the daily (incremental) number of deaths in lab-confirmed cases for all provinces in Peru combined.

Let’s also look at the daily incremental deaths in lab-confirmed cases.


Clearly daily counts of deaths are continuing to rise despite the fact that the daily count of new cases is now falling. This is not surprising, given that it takes some time for cases to either recover or to die, and therefore the trend in deaths will necessarily lag behind any trend in daily incidence.

Fitting an SIR model to the Peru data


On 4 February 2020, data science blogger Learning Machines posted this analysis of the COVID-19 outbreak, in which he fitted the classic SIR (Susceptible-Infectious-Recovered) model to the incidence data for all of China. I’ll use his explanation of how to fit this model using R as based for this analysis.

The basic idea behind the SIR model of communicable disease outbreaks is that there are three groups (also called compartments) of people: those who are healthy but susceptible to the disease S, the infectious (and thus, infected) I and people who have recovered R:

Source: wikipedia

Source: wikipedia

To model the dynamics of the outbreak we need three differential equations, to describe the rates of change in each group, parameterised by β which controls the transition between S and I and γ which controls the transition between I and R:


$$\frac{dS}{dt} = - \frac{\beta I S}{N}$$


$$\frac{dI}{dt} = \frac{\beta I S}{N}- \gamma I$$


$$\frac{dR}{dt} = \gamma I$$

The first step is to express these differential equations as an R function, which is easier than one might think – the expressions in the code are just direct translations of the differential equations, with respect to time t of course.

To fit the model to the data we need two things: a solver for these differential equations and an optimiser to find the optimal values for our two unknown parameters, β and γ. The function ode() (for ordinary differential equations) from the deSolve package for R makes solving the system of equations easy, and to find the optimal values for the parameters we wish to estimate, we can just use the optim function built into base R. Specifically what we need to do is minimise the sum of the squared differences between I(t), which is the number of people in the infectious compartment I at time t, and the corresponding number of cases as predicted by our model (t). This quantity is known as the residual sum of squares (RSS)1.


RSS(β, γ) = ∑t(I(t)−(t))2

I now proceed to fit a model to the incidence data for all of Peru. We need a value N for the initial uninfected population. Once again we’ll scrape population data from a suitable wikipedia page.

The approximate population of Peru in 2017 was 31,237,385 people, according to this wikipedia page.

Next, we need to create a vector with the daily cumulative incidence for Peru, from 6th March when our daily incidence data starts, through to current date. We’ll then compare the predicted incidence from the SIR model fitted to these data with the actual incidence since 6th March. We also need to initialise the values for S, I and R.

Then we need to define a function to calculate the RSS, given a set of values for β and γ.

Finally, we can fit the SIR model to our data by finding the values for β and γ that minimise the residual sum of squares between the observed cumulative incidence and the predicted cumulative incidence. We also need to check that our model has converged, as indicated by the message shown below:

## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
##           [,1]      [,2]
## [1,] -7.379413 -7.378935
## [2,] -7.378935 -7.378456

Convergence is confirmed. Now we can examine the fitted values for β and γ.

##      beta     gamma 
## 0.4670875 0.2635239

Those values don’t mean a lot, per se, but let’s use them to get the fitted number of people infected to our SIR model for the dates up to 07Apr20, which is the latest day that Peru published molecular test results without serological tests, and compare those fitted values with the observed data.

That looks like a reasonably good fit to the observed cumulative incidence data, so we can now use our fitted model to calculate the basic reproduction number R0 which gives the average number of susceptible people who are infected by each infectious person:


$$R_{0} = \frac{\beta}{\gamma}$$

That’s very easy to calculate, and we get:

##       R0 
## 1.772467

An R0 > 1.5 is consistent the values calculated by others for COVID-19, and is also consistent with the R0 for SARS and MERS, which are similar diseases also cause by coronavirus.

Using the SIR model for Peru to make predictions


An obvious next step is to use our fitted SIR model to make predictions about the future course of the outbreak. However, caution is required, because the SIR model assumes a fixed reproduction number, but if public health interventions have been implemented, such as quarantining of cases, contact tracing and isolation of those contacts, and general restrictions on social mixing, such as closing the country, then the effective reproduction number Re will be dynamic and should fall as those interventions are progressively implemented, to values considerably less than the basic reproduction number R0, which reflects the behaviour of the virus at the beginning of an epidemic before any response has been implemented.

So let’s use our SIR model, fitted to the first 67 days of data, to extrapolate out to the current date, and compare that against the observed values:

We can see that the actual incidence similar than that predicted by our model. The reason is that, even with public health interventions implemented by the Peruvian authorities, the Re of the COVID-19 in Peru hasn’t been reduced by much. This is something that will need to be addressed ASAP.

When the Re falls below 1.0, the peak of the epidemic will have been reached and the outbreak will eventually die out.

Using SIR model fitted for Peru to analyze outbreak


It is instructive to use our model fitted with up to date lab-confirmed cases data, to see what would happen if the outbreak were left to run its course, without public health interventions.

It is easier to see what is going on if we use a log scale:

I’s clear that, if outbreak follows model above, it would ended up quite bad due that Peruvian Health System is not design to support this pandemic.

Additionally, we observe that model above, which is based on molecular test results only, differs from official peruvian numbers mainly due that official numbers are a mix of serological and molecular test results which may lower accuracy on official numbers.

At this point, it is worth remarking on the importance of decisive public health intervention to limit the spread of such epidemics. Without such interventions, tens of millions of people could be infected, as our model predicts, and even with only a one or two per cent mortality rate, hundreds of thousands of people would die.

Epidemic trajectory model using log-linear models

As noted above, the initial exponential phase of an outbreak, when shown in a semi-log plot (the y-axis with a logarithmic transform), looks somehow linear. This suggests that we may be able to model epidemic growth and decay using a simple log-linear model:


log(y) = rt + b

where y is the incidence, r is the growth rate, t is the number of days since a specific point in time (typically the start of the outbreak), and b is the intercept. Separate models are fitted to the growth and the decay parts of the epidemic (incidence data) curve.

I proceed to find the peak of confirmed possitive cases based on current available data.

Now I proceed to fit a log-linear model for the growth phase before the peak. I can plot the fitted values from our model (with confidence limits) on top of the actual observed possitive incidence data for Peru.

And here I will evaluate force-of-infection λ plot

Model Fitted Statistics
Last update on: 2020-05-13
Dates R2 Adj. R2 Deviance
all 0.88 0.88 40.52

From the model, we can extract various parameters of interest: the growth rate prior to the peak was 0.11 (95% CI 0.099 - 0.12).

This growth rates is equivalent to a doubling time of 6.4 days (95% CI 5.8 - 7.0 days).

The doubling time estimate is quite ilustrative on informing current public health intervention policy in Peru.

Assessment: Outbreak in Peru is not in control yet and may still growth if public health policies are not followed.

Estimating the reproduction number from log-linear models


Based on log-linear model of the epidemic trajectory I’m able to estimate the reproduction number in the growth phase of the epidemic. I need to provide a distribution for the serial interval time, which is the time between the onset of a primary case and the time of onset in its secondary cases. I’ll use a mean of 7.5 days and a standard deviation of 3.4 days to parameterise a discrete gamma distribution for the serial interval based on analysis of just 5 primary cases amongst the first 450 cases in Wuhan, published by Li et al.. Here is a histogram of our calculated distribution of possible values for R0 for the growth phase, based on the log-linear model we fitted to the Peru incidence data:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.905   1.968   2.018   2.015   2.057   2.129

Note that the central estimates for R0 are considerably higher than those we calculated with a SIR model fitted to the same Peru data, and are also higher than R0 estimates published elsewhere, but they are consistent with estimates of the instantaneous effective reproduction number Re which are calculated in the next section2.

Estimating changes in the effective reproduction number


It would be useful to estimate the current effective reproduction number Re on a day-by-day basis so as to track the effectiveness of public health interventions in Peru, and possibly predict at the earliest opportunity when an outbreak will turn the corner.

There are several available methods for estimating Re – . Here I will focus on one method, developed in 2013 by Anne Cori and colleagues at Imperial College, London, which permits estimation of the instantaneous effective reproduction number, which is exactly want we want in order to track the effectiveness of containment efforts. Full details are available in the original paper on the method, with extensions described in a later paper by Thompson et al..

Once again, I’ll start with the counts of lab-confirmed posstive cases in Peru from 06th March 2020 onwards.

A critical parameter for the Cori model is the serial interval (SI). The SI is the time between onset of symptoms of each case of the disease in question, and the onset of symptoms in any secondary cases that result from transmission from the primary cases. In other words, it is the time between cases in the (branching) chain of transmission of the disease. A moment’s reflection reveals that the SI is, in fact, a statistical distribution of serial interval times, rather than a fixed value. That distribution can be simulated, typically using a discrete gamma distribution with a given mean and standard deviation.

There are different models that allows uncertainty to be incorporated into the parameterisation of this distribution, and it even allows the SI distribution to be estimated empirically, using Bayesian methods, from individual line-listings of cases. We’ll examine all of those capabilities.

I’m currently using one published estimate of the serial interval distribution, derived from analysis of just 5 primary cases amongst the first 450 cases in Wuhan, published by Li et al.. They estimate the serial interval distribution to have a mean of 7.5 days with a standard deviation of 3.4 days. This is almost identical to the serial interval parameters for the MERS virus, which were a mean of 7.6 and a SD of 3.4. This similarity is not surprising given that both pathogens are coronavirii. Let’s use that to estimate the instantaneous Re for Peru.

The first thing to note is that the slope of the effective reproduction number curve have yet to show a prominent downward shift, which strongly suggests that containment efforts are yet to succeed in reducing transmission of the disease in Peru.

The second thing to note is that the 7-day sliding window estimates of instantaneous Re are very high, approaching 15 at the peak. This seems unlikely. It is possible that the Cori model is flawed but has been shown to accurately estimate Re using a wide range of historical outbreak data. There are, however, a number of possible alternative explanations.

One possible explanation is that COVID-19 is transmissible before the onset of symptoms, resulting in much shorter serial intervals than expected, possibly shorter than the incubation period. Alternatively, and very likely, there may be non-symptomatic, sub-clinical spreaders of the disease, who are undetected. Again, the effect is as if the serial interval is very short, although it would be desirable to explicitly model that scenario, but current methods don’t permit that.

To cover the possibility of some cases transmitting the disease very soon after infection, possibly before the onset of symptoms (so-called super-spreaders), and some cases being sub-clinical, and thus undetected, spreading the disease as well, while other cases have a serial interval more consistent with that of MERS or SARS, with a mean around 8 days, I proceed to incorporate this uncertainty around the serial interval distribution by allowing specification of a distribution of distributions of serial intervals. So let’s retain the mean SI estimated by Li et al. of 7.5 days, with an SD of 3.4, but let’s also allow that mean SI to vary between 2.3 and 8.4 using a truncated normal distribution with an SD of 2.0. We’ll also allow the SD or the SI to vary between 0.5 and 4.0.

Recalculating with those SI distribution parameters and meta-parameters, we get these results:



Analysis above looks more reasonable, and clearly the Re is not falling just yet in Peru.

Let’s use data above to re-estimate Re. Bayesian methods are used, and the trace output below is from the MCMC (Markov-chain Monte Carlo) resampling methods used.

That is remarkably similar to our estimates based on an “uncertain” meta-distribution for the serial interval, above, which is encouraging.

Nonetheless some of the Re values are still rather high. Remember, these are instantaneous reproductive numbers, averaged over a sliding 7-day window. Let’s change that to a shorter window to see what the day-to-day changes in Re are, rather than based on an aggregation of the preceding 7-days. I’ll plot the results using a logarithmic scale with a red reference line at 1.0, representing the reproduction number below which the outbreak will start to die out.

So, it looks like the outbreak has yet to been brought under control in Peru, at least based on the published lab-confirmed possitive case numbers and the available serial interval data or estimates of its distribution.

It is clear that control measures have yet to work by looking at the daily incremental incidence numbers for Peru:

Using other models to understand outbreak in Peru


Here I will use a different approach such as linear and GLM regressions to fit COVID-19 outbreak in Peru:

## PERU  --  72059 
## ===============================   running models...=============================== 
##   Linear Regression (lm): 
## 
## Call:
## lm(formula = y.var ~ x.var)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15373  -9918  -2261   7232  39127 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13803.85    2378.98  -5.802 6.39e-08 ***
## x.var          417.28      36.55  11.418  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12500 on 110 degrees of freedom
## Multiple R-squared:  0.5424, Adjusted R-squared:  0.5382 
## F-statistic: 130.4 on 1 and 110 DF,  p-value: < 2.2e-16
## 
## -------------------------------------------------------------------------------- 
##   Linear Regression (lm): 
## 
## Call:
## lm(formula = y.var ~ x.var)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0462 -0.6442  0.3737  0.6572  2.5485 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.678564   0.234458  -11.42   <2e-16 ***
## x.var        0.130109   0.003602   36.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.232 on 110 degrees of freedom
## Multiple R-squared:  0.9223, Adjusted R-squared:  0.9216 
## F-statistic:  1305 on 1 and 110 DF,  p-value: < 2.2e-16
## 
## -------------------------------------------------------------------------------- 
##   GLM using Family [1] "poisson" : 
## 
## Call:
## glm(formula = y.var ~ x.var, family = family)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -68.935  -23.938   -9.991   -1.967   31.896  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 1.578e+00  8.582e-03   183.9   <2e-16 ***
## x.var       8.797e-02  8.433e-05  1043.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2873086  on 111  degrees of freedom
## Residual deviance:   53521  on 110  degrees of freedom
## AIC: 54171
## 
## Number of Fisher Scoring iterations: 5
## 
## --------------------------------------------------------------------------------

Here we can see the growth rate in Peru:

Finally, here we can observe daily trends based on current official data

##   Processing  Peru  ...

Projections


After creating projections based on a simple SIR model, it is possible to use a more complex model to generate a sophisticated projection as well.

Here below I created a projection based up to 07 May 2020 data. Projection extends up to 14 days from last observed possitive case.


Based on that projection, it will take until May for the outbreak in Peru to be extinguished.


  1. It is also possible to fit SIR and related models by MLE.

  2. They are also consistent with unpublished estimates of R0 for the COVID-19 virus being discussed in various expert in WHO.

 




Created by Alexis Roldan